Mining for Relevant Literature from the MEDLINE Database Using Keyword Scoring

نویسندگان

  • Brian Suomela
  • Miguel A. Andrade
  • Enrique Muro
  • Carolina Perez-Iratxeta
  • Gareth Palidwor
چکیده

MOTIVATION: It would be useful to be able to retrieve a ranked publication list relevant to topics of interest. METHODOLOGY: This novel approach uses semantics built into the Medical SubHeading (MeSH) hierarchy. A training set of papers is constructed, each annotated with MeSH terms of interest (e.g. Stem Cell or some child in its subtree). Frequencies of nouns are computed, and ratios calculated of noun frequencies in the training set to frequencies in MEDLINE. Ratios scored abstracts in the training set, and on an equal number of papers randomly selected from MEDLINE. Performance of the relevance-prediction algorithm was benchmarked by counting Stem Cell papers retrieved after merging ranked lists of scored papers and selecting increasing numbers of top papers. RESULTS: 16 MeSH keywords were used to build a training set of 99,020 papers known to be relevant to Stem Cells. 43,328 and 11,773 nouns were extracted from MEDLINE and the training set, respectively. From the training set, 2,078 keywords occurred more than 100 times in the literature, and top-scoring keywords were verified to be relevant to stem cells. Benchmarking the relevance-prediction algorithm reported an 88% recall and precision of stem cell papers when half of the merged list was selected. 3 Acknowledgments I would like to thank my advisors Dr. Miguel Andrade and Dr. Tony White for their guidance and advice; my colleagues Dr. for their patience with my infinite stream of newbie questions; my parents for bringing me into this world and always encouraging me towards becoming an outstanding member of society; and last but not least Ms. Evelyne Chevrier, for inspiring me when I need it most. 4 Table of Contents 1 Introduction 6 1.1 Goals of Literature Mining 6 1.2 Progress in the field of Literature Mining 7 1.3 Use of Keywords and Hierarchies 13 1. Methodology 16 2.1 Gather databases: 16 2.2 Generate a list of stem cell MeSH keywords: 16 2.3 Extract papers about stem cells: 16 2.4 XML document validation: 17 2.5 Moving a range of files: 17 2.6 Robustly extract papers about stem cells: 17 2.7 Combine lists of stem cell papers: 17 2.8 Extract nouns from MEDLINE papers. 18 2.9 Extract nouns from stem cell papers: 18 2.10 Compute the contrast between stem cell papers and the rest of MEDLINE: 18 2.11 Score papers: 18 2.12 Benchmarking: 19 3 Results 20 3.2 Generate a list of stem cell MeSH keywords: 20 3.7 …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Effective Path-aware Approach for Keyword Search over Data Graphs

Abstract—Keyword Search is known as a user-friendly alternative for structured languages to retrieve information from graph-structured data. Efficient retrieving of relevant answers to a keyword query and effective ranking of these answers according to their relevance are two main challenges in the keyword search over graph-structured data. In this paper, a novel scoring function is proposed, w...

متن کامل

Credit scoring in banks and financial institutions via data mining techniques: A literature review

This paper presents a comprehensive review of the works done, during the 2000–2012, in the application of data mining techniques in Credit scoring. Yet there isn’t any literature in the field of data mining applications in credit scoring. Using a novel research approach, this paper investigates academic and systematic literature review and includes all of the journals in the Science direct onli...

متن کامل

PPInterFinder—a mining tool for extracting causal relations on human proteins from literature

One of the most common and challenging problem in biomedical text mining is to mine protein-protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder--a web-based text mining tool to extract human PPIs from biomedical lit...

متن کامل

mirPub: a database for searching microRNA publications

SUMMARY Identifying, amongst millions of publications available in MEDLINE, those that are relevant to specific microRNAs (miRNAs) of interest based on keyword search faces major obstacles. References to miRNA names in the literature often deviate from standard nomenclature for various reasons, since even the official nomenclature evolves. For instance, a single miRNA name may identify two comp...

متن کامل

Mining literature for protein-protein interactions

MOTIVATION A central problem in bioinformatics is how to capture information from the vast current scientific literature in a form suitable for analysis by computer. We address the special case of information on protein-protein interactions, and show that the frequencies of words in Medline abstracts can be used to determine whether or not a given paper discusses protein-protein interactions. F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004